Decision Tree Models Applied to the Labeling of Text with Parts-of-Speech

نویسندگان

  • Ezra Black
  • Frederick Jelinek
  • John D. Lafferty
  • Robert L. Mercer
  • Salim Roukos
چکیده

1. I n t r o d u c t i o n In this paper we describe work which uses decision trees to estimate probabilities of words appearing with various parts-of-speech, given the context in which the words appear. In principle, this approach affords the opt imal solution to the problem of predicting the correct sequence of parts-of-speech. In practice, the method is limited by the lack of large, hand-labeled training corpora, as well as by the difficulty inherent in constructing a set of questions to be used in the decision procedure. Nevertheless, decision trees provide a powerful mechanism for tackling the problem of modeling long-distance dependencies. The following sentence is typical of the difficulties facing a tagging program: The new energy policy announced in December by the P r i m e Min is ter will guarantee sufficien~ oil supplies at one price only. structed a complete set of binary questions to be asked of words, using a mutual information clustering procedure [2]. We then extracted a set of events from a 2-million word corpus of hand-labeled text. Using an algori thm similar to that described in [1], the set of contexts was divided into equivalence classes using a decision procedure which queried the binary questions, splitting the da ta based upon the principle of max imum mutual information between tags and questions. The resulting tree was then smoothed using the forward-backward algori thm [6] on a set of held-out events, and tested on a set of previously unseen sentences from the hand-labeled corpus. The results showed a modest improvement over the usual hidden Markov model approach. We present explanations and examples of the results obtained and suggest ideas for obtaining further improvements. 2. D e c i s i o n T r e e s The problem at hand is to predict a tag for a given word in a sentence, taking into consideration the tags assigned to previous words, as well as the remaining words in the sentence. Thus, if we wish to predict tag S,~ for word w~ in a sentence S -wl, w2, • • •, wN, then we must form an est imate of the probabili ty The usual hidden Markov model, trained as described the last section of this paper, incorrectly labeled the verb announced as having the active rather than the passive aspect. If, however, a decision procedure is used to resolve the ambiguity, the context may be queried to determine the nature of the verb as well its agent. We can easily imagine, for example, tha t if the bat tery of available questions is rich enough to include such queries as "Is the previous noun inanimate?" and "Does the preposition by appear within three words of the word being tagged?" then such ambiguities may be probabilistically resolved. Thus it is evident that the success of the decision approach will rely in the questions as well as the manner in which they are asked. In the experiments described in this paper, we conP(S,~ [ $1, $2,. . .S,~-1 and wl , w2 . . . , war). We will refer to a sequence ($1, . . . , t,~-l; w l , . . . , wN) as a history. A generic history is denoted as H, or as H = (HT, Hw), when we wish to separate it into its tag and word components. The set of histories is denoted by 7-/, and a pair (t, H) is called an event. A tag is chosen from a fixed tag vocabulary VT, and words are chosen from a word vocabulary Vw. Given a training corpus E of events, the decision tree method proceeds by placing the observed histories into equivalence classes by asking binary questions about them. Thus, a tree is grown with each node labeled by a question q : 7-/ --~ {True, False}. The entropy of tags at a

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

L2 Learners’ Lexical Inferencing: Perceptual Learning Style Preferences, Strategy Use, Density of Text, and Parts of Speech as Possible Predictors

This study was intended first to categorize the L2 learners in terms of their learning style preferences and second to investigate if their learning preferences are related to lexical inferencing. Moreover, strategies used for lexical inferencing and text related issues of text density and parts of speech were studied to determine their moderating effects and the best predictors of lexical infe...

متن کامل

Studying impressive parameters on the performance of Persian probabilistic context free grammar parser

In linguistics, a tree bank is a parsed text corpus that annotates syntactic or semantic sentence structure. The exploitation of tree bank data has been important ever since the first large-scale tree bank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of tree bank is becoming more widely appreciated in linguistics research as a whole. F...

متن کامل

روشی جدید جهت استخراج موجودیت‌های اسمی در عربی کلاسیک

In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...

متن کامل

The assessment of the habitat preferences of the River prawn (Macrobrachium nipponens) in wetland using decision tree and generalized linear model

Four sampling sites were selected in different parts of the Anzali wetland watershed to predict the habitat preferences of the river prawn (Macrobrachium nipponens). A set of abiotic variables together with the abundance of the species were monthly measured at each sampling location during the 1- year study period (2017-2018). The results of Mann-Whitney test (given the non-normal data) showed ...

متن کامل

Comparison of gestational diabetes prediction with artificial neural network and decision tree models

Background: Gestational diabetes mellitus (GDM) is one of the most common metabolic disorders in pregnancy, which is associated with serious complications. In the event of early diagnosis of this disease, some of the maternal and fetal complications can be prevented. The aim of this study was to early predict gestational diabetes mellitus by two statistical models including artificial neural ne...

متن کامل

Comparison of Three Decision-Making Models in Differentiating Five Types of Heart Disease: A Case Study in Ghaem Sub-Specialty Hospital

Introduction: cardiovascular diseases are becoming the main cause of mortality and morbidity in most countries. This research goal was to predict the types of heart diseases for more accurate diagnosis by data mining and neural network technics. Method: This research was an applied-survey study and after data preprocessing, three approaches of neural network, decision making tree and Bayes simp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1992